I. Introduction

Capital Bikeshare (CaBi) was one of the first big public bikeshare systems in the United State offering bike rental service within the DC Metro area. The program offers bikes for rental (by single-ride or membership) at nearly 700 stations in DC and Arlington, VA (and more in more distant cities). One of the biggest objectives of bikesharing systems is to make it possible that people can bike without owning their own bicycle at the expense of a small subscription fee.

Bikeshare system has been proven to be a very genius and beneficial initiative. However, rebalancing is the hardest part of bikesharing. The CaBi system has over 10,000 available bike docks across 626 stations, but has about 4,700 bikes in circulation and leaves 5,300 docks empty. Even with optimal distribution, the average station is below 50% capacity. To combat this challenge and ensure bike availability, CaBi has made best to “rebalance” bikeshare stations by moving bikes to and from empty and full stations respectively for years. As a result, an army behind CaBi made up of “rebalancers” emerged to specifically move 20 to 25 bikes each trip among 200 stations every day. However, the stress of “rebalancers” is overwhelming. The average rebalancer moves 70 to 80 bikes a day, but the most competitive ones move as many as 150 bikes. In addition to the stress of competition, dealing with frustrated bikers and extreme weather conditions, rebalancing is also a fight against the clock.

Therefore, how can we measure the rebalancing priority and determine the unmet demands? How should we redistribute bikes appropriately and efficiently? Data analysis seems to be the ultimate tool we can leverage. Previous efforts have been put to develop a tracker system to notice which station is full or empty that cannot predict the demand of bikes using but only indicate the lag result. Consequently, this analysis is aimed at exploring the user pattern and building model to indicate rebalancing priority.

The task of rebalancing is so complex that there is an endless list of factors that influence bikeshare user patterns, including time of day, day of week, location, amenity, weather, elevation and so on. This analysis will include those critical features into the model, but also add time lag features (from 1 hour to the whole day) to accurately count serial relations.

II. Data Loading and Feature Engineering

2.1 Import Bike Share Data

Each quarter, CaBi publishes [downloadable files] (https://www.capitalbikeshare.com/system-data) of its trip data. This data has been processed to remove trips that are taken by staff as they service and inspect the system, trips that are taken to/from any of the company’s “test” stations at the warehouses and any trips lasting less than 60 seconds. Each dataset includes: * Duration – Duration of trip * Start Date – Includes start date and time * End Date – Includes end date and time * Start Station – Includes starting station name and number * End Station – Includes ending station name and number * Bike Number – Includes ID number of bike used for the trip * Member Type – Indicates whether user was a “registered” member.

For this analysis, we select 2020 trip data occurring from Week 35 to Week 39, and create more features as below: * interval60 - represents the hour of start time * interval15 - represents 15-minute interval of the start time. * week - week number of the trip * dotw - weekday of the trip

2.2 Import Census Data

This analysis uses data from the 2014-2018 ACS 5-year estimates. The following demographic variables are selected from ACS 2018 for census tracts in Washington D.C.including: * Total population * Median household income * Median age * White population percentage * Travel time * Number of commuters * Means of transportation * Total public transportation

Based on the data, we create Percent_White to identify white population proportion, Mean_Commute_Time to identify average commute time, and Percent_Taking_Public_Trans to identify the proportion of taking public transportation.

2.3 Import Weather Data

Import weather data from Ronald Reagan Washington National Airport (code DCA) using riem_measures function. We mutate Temperature, Precipitation and Wind_Speed, on an hourly basis and plot trends respectively from Week 35 to Week 39 in 2020.

weather.Panel <- 
  riem_measures(station = "DCA", date_start = "2020-09-01", date_end = "2020-10-30") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

2.4 Create Amenity Distance Feature

In this section, we calculate “knn” distance of bikestations to three critical features imported from DC Opendata portal: * DC metro station: This dataset contains points representing Metro facilities in DC, 40 records in total. * DC grocery stores: A point feature class of grocery stores in the District. There are 79 records in the dataset. * DC valet parking lot: A point feature class of locations and businesses permitted to offer valet parking. There are in total 1.987 records.

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <- as.matrix(measureFrom)
  measureTo_Matrix <- as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    dplyr::summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull() 
  
  return(output)  
}

# DC bike data

bike <- st_read('https://opendata.arcgis.com/datasets/a1f7acf65795451d89f0a38565a975b3_5.geojson') %>%
  st_transform('ESRI:102685') 
## Reading layer `Capital_Bike_Share_Locations' from data source `https://opendata.arcgis.com/datasets/a1f7acf65795451d89f0a38565a975b3_5.geojson' using driver `GeoJSON'
## Simple feature collection with 596 features and 17 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -77.36842 ymin: 38.78264 xmax: -76.82554 ymax: 39.12601
## geographic CRS: WGS 84
# DC metro station data
metro <- st_read('https://opendata.arcgis.com/datasets/54018b7f06b943f2af278bbe415df1de_52.geojson') %>%
  na.omit() %>%
  st_transform('ESRI:102685') 
## Reading layer `Metro_Stations_in_DC' from data source `https://opendata.arcgis.com/datasets/54018b7f06b943f2af278bbe415df1de_52.geojson' using driver `GeoJSON'
## Simple feature collection with 40 features and 7 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -77.085 ymin: 38.84567 xmax: -76.93526 ymax: 38.97609
## geographic CRS: WGS 84
# DC grocery stores data
grocery <- st_read('https://opendata.arcgis.com/datasets/1d7c9d0e3aac49c1aa88d377a3bae430_4.geojson') %>%
  na.omit() %>%
  st_transform('ESRI:102685') 
## Reading layer `Grocery_Store_Locations' from data source `https://opendata.arcgis.com/datasets/1d7c9d0e3aac49c1aa88d377a3bae430_4.geojson' using driver `GeoJSON'
## Simple feature collection with 79 features and 27 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -77.09596 ymin: 38.84495 xmax: -76.94746 ymax: 38.98446
## geographic CRS: WGS 84
# DC valet parking lot data
parking <- st_read('https://opendata.arcgis.com/datasets/378d0c5205234154afe195de735e2e02_3.geojson') %>%
  st_transform('ESRI:102685') 
## Reading layer `Valet_Parking' from data source `https://opendata.arcgis.com/datasets/378d0c5205234154afe195de735e2e02_3.geojson' using driver `GeoJSON'
## Simple feature collection with 1994 features and 22 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -77.10232 ymin: 38.82391 xmax: -76.91573 ymax: 38.97851
## geographic CRS: WGS 84
# engineer KNN features
bike <-
  bike %>% 
  mutate(
    metro_nn1 = nn_function(st_coordinates(st_centroid(bike)), st_coordinates(st_centroid(metro)), 1),
    metro_nn2 = nn_function(st_coordinates(st_centroid(bike)), st_coordinates(st_centroid(metro)), 2),
    grocery_nn1 = nn_function(st_coordinates(st_centroid(bike)), st_coordinates(st_centroid(grocery)), 1),
    grocery_nn2 = nn_function(st_coordinates(st_centroid(bike)), st_coordinates(st_centroid(grocery)), 2),
    parking_nn1 = 
nn_function(st_coordinates(st_centroid(bike)), st_coordinates(st_centroid(parking)), 1)) %>%
  dplyr::select(NUMBER_OF_BIKES,NUMBER_OF_EMPTY_DOCKS,ID,geometry,metro_nn1,metro_nn2,grocery_nn1,grocery_nn2,parking_nn1) %>%
  rename(start_station_id = ID)

2.5 Create Full Space-Time Panel

Here, we create the full panel by summarizing counts by docks for each time interval, keep census info and lat/lon information along for joining later to other data. We remove data for time intervals that are missing values.

study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), 
              start_station_id = unique(dat_census$start_station_id)) %>%
  left_join(., dat_census %>%
              dplyr::select(start_station_id, start_station_name, Origin.Tract, start_lng, start_lat)%>%
              distinct() %>%
              group_by(start_station_id) %>%
              slice(1))

ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station_id, start_station_name, Origin.Tract, start_lng, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(start_station_id) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, dcCensus %>%
              as.data.frame() %>%
              dplyr::select(-geometry), by = c("Origin.Tract" = "GEOID"))

# Integrate built features into whole dataset
ride.panel <- merge(x = ride.panel, y = bike, by = 'start_station_id',
                    all.y  = TRUE) %>%
    filter(start_station_id >12) %>%
  na.omit(ride.panel$interval60)

2.6 Create Time Lags

Creating time lag variables will add additional nuance about the demand during a given time period. This analysis create lagHour, lag2Hours,lag3Hours, lag4Hours,lag12Hours, and lag1dayto show trip counts hours before and during that day.

ride.panel <- 
  ride.panel %>% 
  arrange(start_station_id, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24)) %>%
   mutate(day = yday(interval60))

2.7 Split Training Set and Test Set

III. Exploratory Analysis

This section conducts examinations on serial autocorrelation, spatial autocorrelation, space/time correlation and correlation on weather features.

3.1 Serial Autocorrelation

3.1.1 Bikeshare count by week

Figure 3.1.1 indicates that on average, daily bikeshare trips are less than 2. Trip numbers fluctuate across hours of each day, showing peaks and low ebbs. Specifically, trip count of test set is on average lower than it of training set.

mondays <- 
  mutate(ride.panel,
         monday = ifelse(dotw == "Mon" & hour(interval60) == 1,
                         interval60, 0)) %>%
  filter(monday != 0) 



rbind(
  mutate(ride.Train, Legend = "Training"), 
  mutate(ride.Test, Legend = "Testing")) %>% 
    group_by(Legend, interval60) %>% 
      summarize(Trip_Count = sum(Trip_Count)) %>%
      ungroup() %>% 
      ggplot(aes(interval60, Trip_Count, colour = Legend)) + geom_line() +
        scale_colour_manual(values = palette2) +
        geom_vline(data = mondays, aes(xintercept = monday), linetype = "dotted", color = '#596a71') +
        labs(title="Bikeshare trips by week in DC: Week 35- Week 39",
             subtitle="Dotted Lines for Each Monday\n",
             caption = 'Figure 3.1.1',
             x="Day", y="Trip Count") +
        plotTheme() 

3.1.2 Bike share trips on weekend vs weekday, DC

Figure 3.1.2.1 shows the plots of bikeshare trip counts by days of the week. We can see that Monday to Friday generally follows the same trend with a small peak at around 8am-9am, a distinct peak at around 17:00pm-19:00pm, and then decrease until midnight. While for Saturday and Sunday, bike share trip mostly occur from 10:00am to 17:00pm. Figure 3.1.2.2 indicates the patterns of bike share between weekdays and weekend more clearly.

ggplot(dat_census %>% mutate(hour = hour(started_at)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  scale_color_manual(values = palette7,
                          name = 'Weekday') +
  labs(title="Bike share trips in DC, by day of the week, Sept & Oct, 2020\n",
       caption = 'Figure 3.1.2.1',
       x="Hour", 
       y="Trip Counts")+
     plotTheme()

ggplot(dat_census %>% mutate(hour = hour(started_at),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"))) +
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
    scale_color_manual(values = palette2,
                          name = 'Period') +
  labs(title="Bike share trips in DC - weekend vs weekday, Week 35 - Week 39, 2020\n",
       caption = 'Figure 3.1.2.2',
       x="Hour", 
       y="Trip Counts")+
     plotTheme()

3.1.3 Mean Number of Hourly Trips Per Station

Figure 3.1.3 shows the average hourly bikeshare trips per station in different time periods. Consistent with the findings above, more trips occurs after 10:00am a day. Specifically speaking, bike sharings Mid-day (10:00am-15:00pm) and night rush hour (15:00am-18:00am) are more active than morning rush time (7:00am-10:00am) and overnight (18:00am-24:00am), when zero or one trip occurred at stations.

dat_census %>%
  filter(start_station_id <= 626) %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station_name, time_of_day) %>%
         tally()%>%
  group_by(start_station_name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), fill = '#596a71',binwidth = 0.5)+
  labs(title="Mean Number of Hourly Trips Per Station. DC, Sept & Oct, 2020\n",
       caption = 'Figure 3.1.3',
       x="Trip Count", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme()

3.1.4 Time Lag Features Exploration

Overall, the correlation between time lag and trip is minor. The strongest correlation is the counts in 2 hour lag time, with R = 0.11, followed by counts with 1 hour lag and counts in 3 hours lag with R = 0.05. The correlation coefficient diminish to 0 for counts of time more than 4 hours. Therefore, lag4Hours, lag12Hours and lag1day will be excluded from model building.

plotData.lag <-
  filter(as.data.frame(ride.panel), week == 39) %>%
  dplyr::select(starts_with("lag"), Trip_Count) %>%
  gather(Variable, Value, -Trip_Count) %>%
  mutate(Variable = fct_relevel(Variable, "lagHour","lag2Hours","lag3Hours",
                                          "lag4Hours","lag12Hours","lag1day"))


correlation.lag <-
  group_by(plotData.lag, Variable) %>%
    summarize(correlation = round(cor(Value, Trip_Count, use = "complete.obs"), 2)) 

ggplot(plotData.lag, aes(Value, Trip_Count)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.lag, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "darkred") +
  facet_wrap(~Variable) +
  labs(title = "Bikeshare trip counts as a function of time lags",
       subtitle = 'Week 39 in 2020, DC\n',
       caption = 'Figure 3.1.4') +
  plotTheme()

3.2 Spatial Autocorrelation

This section create multiple plots to show kike share trips per hour by station in maps. We can conclude that despite trip number fluctuations during a day, more trips occurred in the Center City.

ggplot()+
  geom_sf(data = dcTracts %>%
          st_transform(crs=4326))+
  geom_point(data = ride.panel %>%

            mutate(hour = hour(interval60),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              na.omit(ride.panel$interval60) %>%
              group_by(start_station_id, start_lat, start_lng, weekend, time_of_day) %>%
              tally(),
            aes(x=start_lng, y = start_lat, color = n), 
            fill = "transparent", alpha = 0.4, size = 0.05)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma",
  name = 'Counts')+

  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lng), max(dat_census$start_lng))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bikeshare trips per hour by station",
       subtitle = "Week35-Week39, 2020; DC\n",
       caption = 'Figure 3.2.4')+
  mapTheme +
  plotTheme()

3.3 Space/Time Correlation: Animation on Week39

An animated map of hourly bikeshare trip number for Week 39 per station is created to show the space and time combined pattern. In most of time, bike share trip occurred less than once.

week39 <-
  filter(ride.panel , week == 39 ) 


animation.data <- 
  week39 %>% 
  group_by(interval60, start_station_id) %>%
  summarize(Trip_Count = sum(Trip_Count, na.rm=T)) 

dock <- bike %>%
  dplyr::select(start_station_id) %>% 
  filter(start_station_id > 12)
  

animation.data <- 
  merge(x = animation.data, y = dock, by = 'start_station_id',
                    all.y  = TRUE) %>%
  na.omit(animation.data$Trip_Count) 

animation.data <- animation.data %>%
  st_sf() %>%
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trip",
                             Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
                             Trip_Count > 4 & Trip_Count <= 6 ~ "4-6 trips",
                             Trip_Count > 6 ~ "6+ trips")) %>%
    mutate(Trips  = fct_relevel(Trips, "0 trip","1-3 trips","4-6 trips",
                                       "6+ trips"))

animation <- 
  ggplot() +
  geom_sf(data = dcTracts %>%
          st_transform(crs=4326))+
geom_sf(data = animation.data , aes(color = Trips)) +
  scale_color_manual(values = palette4) + 
    labs(title = "Bikeshare trip counts for Week39 in 2020, DC",
         subtitle = "1 hour intervals: {current_frame}\n",
         caption = 'Figure 3.3') +
  plotTheme() + 
    transition_manual(interval60) +
    mapTheme

gganimate::animate(animation,  duration=30,renderer = gifski_renderer())

3.4 Weather Correlation

As indicated above, this analysis includes three weather-related variables: precipitation, temperature and wind speed. This section presents how the three weather features effect bikeshare trip respectively.

During the study period, the precipitation in D.C was limited. Figure 3.4.1.2 shows that although the mean trip count is minor, more precipitation significantly reduces the occurrence of bikeshare trips.

During the study period, temperature showed distinct fluctuations across hours and days. The relation between temperature and bike share trips seems to be un-intuitive since sometimes they exhibit positive pattern while sometimes not, which indicating that to set time period features fixed is critical to model building.

During the study period, wind speed fluctuated across hours per day, but showed similar pattern across days. As shown in Figure 3.4.3.2, the wind speed seemingly does not affect the number of bikeshare trip a lot so we exclude it from the model.

3.4.1 Precipitation

ggplot(ride.panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Precipitation Across Time\n", 
       caption = 'Figure 3.4.1.1',
       x="Hour", y="Perecipitation") + 
  plotTheme()

ride.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Precipitation = first(Precipitation)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Precipitation, Trip_Count)) + 
    geom_point() + geom_smooth(method = "lm", se= FALSE, color = 'red') +
    facet_wrap(~week, ncol=8) + 
    labs(title="Trip Count as A Function of Precipitation by Week\n", 
       caption = 'Figure 3.4.1.2',
         x="Precipitation", y="Mean Trip Count") +
    plotTheme() 

3.4.2 Temperature

ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature Across Time\n", 
       caption = 'Figure 3.4.2.1',
       x="Hour", y="Temperature") + plotTheme()

ride.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Temperature = first(Temperature)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
    geom_point() + geom_smooth(method = "lm", se= FALSE,color = 'red') +
    facet_wrap(~week, ncol=8) + 
    labs(title="Trip Count as A Function of Temperature by Week\n", 
       caption = 'Figure 3.4.2.2',
         x="Temperature", y="Mean Trip Count") +
    plotTheme() 

3.4.3 Wind Speed

ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed Across Time\n", 
       caption = 'Figure 3.4.3.1',
       x="Hour", y="Wind Speed") + plotTheme()

ride.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Wind_Speed = first(Wind_Speed)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Wind_Speed, Trip_Count)) + 
    geom_point() + geom_smooth(method = "lm", se= FALSE,color = 'red') +
    facet_wrap(~week, ncol=8) + 
    labs(title="Trip Count as A Function of Wind Speed by Week\n", 
       caption = 'Figure 3.4.3.2',
         x="Wind Speed", y="Mean Trip Count") +
    plotTheme() 

IV. Model Building and Prediction

This analysis is training models on 3 weeks of data, weeks 37-39, and testing on the preceding 2 weeks, weeks 35-36.

This section firstly creates an initial model reg0 to help determine to rule out the features that are obviously less unrelated to the bikeshare counts. Upon the initial model, features Percent_White, NUMBER_OF_EMPTY_DOCKS, NUMBER_OF_BIKES, parking_nn1, and Wind_Speed are excluded from model building due to irrelevance.

Then, this sections designs four different linear regressions are estimated on ride.Train, each with different fixed effects:

  • reg1 focuses on just time, including hour and weekday fixed effects, and other critical features
  • reg2 focuses on just space effects with the station and and other critical fixed effects
  • reg3 includes both time and space fixed effects.
  • reg4 adds the time lag features: laghour, lag2hours, and lag3hours

Prediction output is saved in nested-format dataset named week_predictions.

V. Model Evaluation and Validation

5.1 MAE Examination by Model

This section calculates Mean Absolute Error (MAE) on ride.Test for each model on Week 35 and Week 36. Generally, Model 1 performs best among all models. But we still need more examinations to compare model efficiency.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week\n",
         caption = "Figure 5.1") +
  plotTheme()

5.2 Error Examination

This section includes error examinations on time, space and the combination of space and time across the four models! Figure 5.2.1 presents the plots of observed and predicted data and examines how they matched across time. We can generally conclude that all the models tend to underestimate bikeshare trips at peak time. Compared to other models, Model4 captures more count fluctuations.

Figure 5.2.2 mapped the mean error by station. Four models have similar performance; MAEs are close to 0 in most areas and are slightly higher in Southeast DC.

Through Figure 5.2.3 to Figure 5.2.6, errors on test set are examined by space and time combined across models. Generally, MAEs do not show distinct patterns among time or space. Model1 and Model4 perform relatively better than the other two models.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id)) %>%
    dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
  scale_color_manual(values = palette2,
                          name = ' ') +
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bike share time series", subtitle = "DC; A test set of Week35 - Week36",  x = "Hour", y= "Station Trips",
           caption = "Figure 5.2.1") +
      plotTheme()

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_lat = map(data, pull, start_lat), 
           start_lng = map(data, pull, start_lng)) %>%
    dplyr::select(interval60, start_station_id, start_lng, start_lat, Observed, Prediction, Regression) %>%
    unnest() %>%

  group_by(Regression, start_station_id, start_lng, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = dcTracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lng, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lng), max(dat_census$start_lng))+
        facet_wrap(~Regression, ncol=4) +
  labs(title="Mean Abs Error by Station",
       subtitle = "A test set of Week35 - Week36, DC\n",
       caption = 'Figure 5.2.2')+
  mapTheme +
  plotTheme()

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_lat = map(data, pull, start_lat), 
           start_lng = map(data, pull, start_lng),
           dotw = map(data, pull, dotw) ) %>%
    dplyr::select(interval60, start_station_id, start_lng, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "Model1")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, start_lng, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = dcTracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lng, y = start_lat, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lng), max(dat_census$start_lng))+
  facet_grid(weekend~time_of_day)+
  labs(title = "Error Examination by Space + Time, Model 1",
       subtitle ="Mean Absolute Errors on Test Set\n",
       caption = "Figure 5.2.3") +
  mapTheme +
  plotTheme()

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_lat = map(data, pull, start_lat), 
           start_lng = map(data, pull, start_lng),
           dotw = map(data, pull, dotw) ) %>%
    dplyr::select(interval60, start_station_id, start_lng, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "Model2")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, start_lng, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = dcTracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lng, y = start_lat, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lng), max(dat_census$start_lng))+
  facet_grid(weekend~time_of_day)+
  labs(title = "Error Examination by Space + Time, Model 2",
       subtitle ="Mean Absolute Errors on Test Set\n",
       caption = "Figure 5.2.4") +
  mapTheme +
  plotTheme()

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_lat = map(data, pull, start_lat), 
           start_lng = map(data, pull, start_lng),
           dotw = map(data, pull, dotw) ) %>%
    dplyr::select(interval60, start_station_id, start_lng, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "Model3")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, start_lng, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = dcTracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lng, y = start_lat, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lng), max(dat_census$start_lng))+
  facet_grid(weekend~time_of_day)+
  labs(title = "Error Examination by Space + Time, Model 3",
       subtitle ="Mean Absolute Errors on Test Set\n",
       caption = "Figure 5.2.5") +
  mapTheme +
  plotTheme()

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_lat = map(data, pull, start_lat), 
           start_lng = map(data, pull, start_lng),
           dotw = map(data, pull, dotw) ) %>%
    dplyr::select(interval60, start_station_id, start_lng, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "Model2")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, start_lng, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = dcTracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lng, y = start_lat, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lng), max(dat_census$start_lng))+
  facet_grid(weekend~time_of_day)+
  labs(title = "Error Examination by Space + Time, Model 4",
       subtitle ="Mean Absolute Errors on Test Set\n",
       caption = "Figure 5.2.6") +
  mapTheme +
  plotTheme()

5.3 Cross Validation

This section conducts “k-fold” cross-validation to test goodness of fit on the whole dataset. To save the operation cost, we set 50 folds to test. If the model generalized well, the distribution of errors would cluster tightly together.

Based on the outputs, moving forward, let’s stick with reg1 and reg4 since the former one shows the smallest MAE while ’reg4` captures most of count fluctuations across time and space. Shown in Figure 5.3.1 and Figure 5.3.2, neither of the two models predict consistently, indicating neither of models generalize well.

kable(reg1.cv$resample) %>% 
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general = "Table 5.1 Cross-validation Test on Model 1\n",
           general_title= '\n') %>%
    scroll_box(width = "100%", height = "200px")
RMSE Rsquared MAE Resample
0.0432902 0.0000027 0.0036820 Fold01
0.0283966 0.0000198 0.0026376 Fold02
0.0462800 0.0000206 0.0039300 Fold03
0.0284380 0.0014995 0.0026404 Fold04
0.0400761 0.0001075 0.0034294 Fold05
0.0432971 0.0000009 0.0036874 Fold06
0.0365682 0.0009561 0.0031320 Fold07
0.0432856 0.0000996 0.0036720 Fold08
0.1660848 0.0041453 0.0052287 Fold09
0.0327876 0.0003988 0.0029074 Fold10
0.0462525 0.0002957 0.0039309 Fold11
0.0327631 0.0000201 0.0028853 Fold12
0.1339194 0.0034608 0.0047238 Fold13
0.0400579 0.0006251 0.0034198 Fold14
0.0517433 0.0002007 0.0044814 Fold15
0.0232194 0.0000011 0.0023678 Fold16
0.0542474 0.0000221 0.0047089 Fold17
0.0400997 0.0000202 0.0033959 Fold18
0.0327863 0.0004938 0.0028805 Fold19
0.0400145 0.0037770 0.0034024 Fold20
0.0490689 0.0000280 0.0041848 Fold21
0.1097378 0.0019016 0.0052397 Fold22
0.0490702 0.0000169 0.0042115 Fold23
0.0164158 0.0034220 0.0020901 Fold24
0.0432701 0.0004947 0.0036333 Fold25
0.0432947 0.0000011 0.0036705 Fold26
0.0634006 0.0001641 0.0041918 Fold27
0.0327526 0.0000501 0.0028845 Fold28
0.0432669 0.0006192 0.0036560 Fold29
0.0327311 0.0004012 0.0028681 Fold30
0.0283761 0.0001156 0.0026207 Fold31
0.0462657 0.0000444 0.0039576 Fold32
0.0432267 0.0028893 0.0036694 Fold33
0.0490670 0.0001389 0.0041728 Fold34
0.0327166 0.0012901 0.0028963 Fold35
0.1047338 0.0036110 0.0046792 Fold36
0.0400630 0.0006332 0.0034074 Fold37
0.0401113 0.0001757 0.0033833 Fold38
0.0462876 0.0000277 0.0039394 Fold39
0.0366321 0.0003524 0.0031396 Fold40
0.0366250 0.0000917 0.0031549 Fold41
0.0801214 0.0028616 0.0044392 Fold42
0.0164958 0.0000790 0.0020523 Fold43
0.0400292 0.0023612 0.0034020 Fold44
0.0400992 0.0000171 0.0033756 Fold45
0.0865792 0.0013794 0.0039364 Fold46
0.0400663 0.0003256 0.0034230 Fold47
0.0165011 0.0001120 0.0021057 Fold48
0.0817752 0.0008559 0.0052523 Fold49
0.0365828 0.0003909 0.0031878 Fold50

Table 5.1 Cross-validation Test on Model 1
ggplot(data = reg1.cv$resample) +
  geom_histogram(aes(x = reg1.cv$resample$MAE), fill = '#88aab8') +
  labs(title="Distribution of Cross-validation MAE on Model 1",
       subtitle = "K = 50\n",
       caption = "Figure 5.3.1 ") +
  xlab('MAE of Model 1') +
  ylab('Count') +
  plotTheme()

kable(reg4.cv$resample) %>% 
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general = "Table 5.2. Cross-validation Test on Model 4\n",
           general_title= '\n') %>%
    scroll_box(width = "100%", height = "200px")
RMSE Rsquared MAE Resample
0.0602215 0.0000369 0.0041594 Fold01
0.0337699 0.0001229 0.0028909 Fold02
0.0297251 0.0000191 0.0026231 Fold03
0.0334278 0.0000697 0.0028920 Fold04
0.0465642 0.0001022 0.0036684 Fold05
0.0300657 0.0000685 0.0026745 Fold06
0.0380862 0.0000514 0.0035067 Fold07
0.0292084 0.2054314 0.0025213 Fold08
0.0497159 0.0001973 0.0042089 Fold09
0.0407484 0.0000840 0.0033157 Fold10
0.0288156 0.0000071 0.0024425 Fold11
0.0334435 0.0000684 0.0027993 Fold12
0.0511833 0.0204985 0.0044066 Fold13
0.0373212 0.0000034 0.0031365 Fold14
0.0413991 0.0046571 0.0034307 Fold15
0.0340207 0.0000316 0.0027415 Fold16
0.0774728 0.0009495 0.0046705 Fold17
0.0372096 0.0000343 0.0030407 Fold18
0.0335019 0.0001593 0.0028415 Fold19
0.0375362 0.0000309 0.0031976 Fold20
0.0745354 0.0128687 0.0041776 Fold21
0.1107386 0.0656669 0.0060292 Fold22
0.0470536 0.0001787 0.0040571 Fold23
0.0445965 0.0000835 0.0038359 Fold24
0.0400170 0.1548555 0.0033314 Fold25
0.0543135 0.0017465 0.0044680 Fold26
0.0376358 0.0000684 0.0031992 Fold27
0.1322769 0.1793546 0.0048077 Fold28
0.0403220 0.0025606 0.0032671 Fold29
0.0567234 0.0038390 0.0045119 Fold30
0.0334309 0.0001289 0.0026275 Fold31
0.0333588 0.0000376 0.0026937 Fold32
0.0468875 0.0002011 0.0038989 Fold33
0.0439480 0.0001112 0.0035155 Fold34
0.0336243 0.0001246 0.0027614 Fold35
0.0494977 0.0001541 0.0040599 Fold36
0.0371159 0.0000087 0.0029222 Fold37
0.1041823 0.0170747 0.0045740 Fold38
0.0376756 0.0000468 0.0036517 Fold39
0.0440850 0.0000862 0.0035782 Fold40
0.0237428 0.0000571 0.0019997 Fold41
0.0291976 0.0000430 0.0024803 Fold42
0.0405220 0.0000931 0.0031924 Fold43
0.0243749 0.0000443 0.0028562 Fold44
0.0463547 0.0024547 0.0040121 Fold45
0.0480744 0.0445496 0.0039247 Fold46
0.0300111 0.0000418 0.0026324 Fold47
0.1631688 0.6544475 0.0058770 Fold48
0.0549916 0.0001908 0.0049219 Fold49
0.0876050 0.0139922 0.0041495 Fold50

Table 5.2. Cross-validation Test on Model 4
ggplot(data = reg4.cv$resample) +
  geom_histogram(aes(x = reg4.cv$resample$MAE), fill = '#88aab8') +
  labs(title="Distribution of Cross-validation MAE on Model 4",
       subtitle = "K = 50\n",
       caption = "Figure 5.3.2 ") +
  xlab('MAE of Model 4') +
  ylab('Count') +
  plotTheme()

  1. Conclusion

Through this analysis, we can conclude that the occurrence peak on weekends is different from the one during the weekdays by examining the serial pattern of bike sharing, while more trips occurred in the center city by examining the spatial pattern of bike sharing. In order to take the complexity of features affecting bike sharing into consideration, this analysis creates models containing serial features, spatial features, weather features and user demographic features. Across the all four models created above, errors are limited, to some extent indicating the accuracy. Despite efforts, time lag features and spatial seemingly do not improve the model on a big scale.

Despite imperfections, Model 1 or Model 4 can be used to address the re-balancing problem since they both capture the impacts the time changing has made on trips. More dispatch resources should be allocated after 10:00 am in Center City, and balance the re-balancers between weekdays and weekends.

One of the limitations of this analysis is that the overall trip numbers are very small, and it is unclear whether it is somehow related to the pandemic. If so, the re-balancing model would be different. In the future, more features need to be taken into consideration such as traffic characteristics, dispatch resources and bike conditions to help initiate better rebalancing plans.